21 research outputs found

    Combining predictions from linear models when training and test inputs differ

    Get PDF
    Methods for combining predictions from different models in a supervised learning setting must somehow estimate/predict the quality of a model's predictions at unknown future inputs. Many of these methods (often implicitly) make the assumption that the test inputs are identical to the training inputs, which is seldom reasonable. By failing to take into account that prediction will generally be harder for test inputs that did not occur in the training set, this leads to the selection of too complex models. Based on a novel, unbiased expression for KL divergence, we propose XAIC and its special case FAIC as versions of AIC intended for prediction that use different degrees of knowledge of the test inputs. Both methods substantially differ from and may outperform all the known versions of AIC even when the training and test inputs are iid, and are especially useful for deterministic inputs and under covariate shift. Our experiments on linear models suggest that if the test and training inputs differ substantially, then XAIC and FAIC predictively outperform AIC, BIC and several other methods including Bayesian model averaging.Comment: 12 pages, 2 figures. To appear in Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI2014). This version includes the supplementary material (regularity assumptions, proofs

    Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It

    Get PDF
    We empirically show that Bayesian inference can be inconsistent under misspecification in simple linear regression problems, both in a model averaging/selection and in a Bayesian ridge regression setting. We use the standard linear model, which assumes homoskedasticity, whereas the data are heteroskedastic, and observe that the posterior puts its mass on ever more high-dimensional models as the sample size increases. To remedy the problem, we equip the likelihood in Bayes' theorem with an exponent called the learning rate, and we propose the Safe Bayesian method to learn the learning rate from the data. SafeBayes tends to select small learning rates as soon the standard posterior is not `cumulatively concentrated', and its results on our data are quite encouraging.Comment: 70 pages, 20 figure

    Better predictions when models are wrong or underspecified

    Get PDF
    Many statistical methods rely on models of reality in order to learn from data and to make predictions about future data. By necessity, these models usually do not match reality exactly, but are either wrong (none of the hypotheses in the model provides an accurate description of reality) or underspecified (the hypotheses in the model describe only part of the data). In this thesis, we discuss three scenarios involving models that are wrong or underspecified. In each case, we find that standard statistical methods may fail, sometimes dramatically, and present different methods that continue to perform well even if the models are wrong or underspecified. The first two of these scenarios involve regression problems and investigate AIC (Akaike's Information Criterion) and Bayesian statistics. The third scenario has the famous Monty Hall problem as a special case, and considers the question how we can update our belief about an unknown outcome given new evidence when the precise relation between outcome and evidence is unknown.UBL - phd migration 201

    Adapting AIC to conditional model selection

    Get PDF
    In statistical settings such as regression and time series, we can condition on observed information when predicting the data of interest. For example, a regression model explains the dependent variables y1,
,yny_1, \ldots, y_n in terms of the independent variables x1,
,xnx_1, \ldots, x_n. When we ask such a model to predict the value of yn+1y_{n+1} corresponding to some given value of xn+1x_{n+1}, that prediction's accuracy will vary with xn+1x_{n+1}. Existing methods for model selection do not take this variability into account, which often causes them to select inferior models. One widely used method for model selection is AIC (Akaike's Information Criterion \cite{Akaike}), which is based on estimates of the KL divergence from the true distribution to each model. We propose an adaptation of AIC that takes the observed information into account when estimating the KL divergence, thereby getting rid of a bias in AIC's estimate

    Graphical Representations for Algebraic Constraints of Linear Structural Equations Models

    Get PDF
    The observational characteristics of a linear structural equation model can be effectively described by polynomial constraints on the observed covariance matrix. However, these polynomials can be exponentially large, making them impractical for many purposes. In this paper, we present a graphical notation for many of these polynomial constraints. The expressive power of this notation is investigated both theoretically and empirically.Comment: To appear in the proceedings of the 11th International Conference on Probabilistic Graphical Models (PGM 2022

    Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it

    Get PDF
    We empirically show that Bayesian inference can be inconsistent under misspecification in simple linear regression problems, both in a model averaging/selection and in a Bayesian ridge regression setting. We use the standard linear model, which assumes homoskedasticity, whereas the data are heteroskedastic (though, significantly, there are no outliers). As sample size increases, the posterior puts its mass on worse and worse models of ever higher dimension. This is caused by hypercompression, the phenomenon that the posterior puts its mass on distributions that have much larger KL divergence from the ground truth than their average, i.e. the Bayes predictive distribution. To remedy the problem, we equip the likelihood in Bayes' theorem with an exponent called the learning rate, and we propose the SafeBayesian method to learn the learning rate from the data. SafeBayes tends to select small learning rates, and regularizes more, as soon as hypercompression takes place. Its results on our data are quite encouraging

    Causal Entropy and Information Gain for Measuring Causal Control

    Full text link
    Artificial intelligence models and methods commonly lack causal interpretability. Despite the advancements in interpretable machine learning (IML) methods, they frequently assign importance to features which lack causal influence on the outcome variable. Selecting causally relevant features among those identified as relevant by these methods, or even before model training, would offer a solution. Feature selection methods utilizing information theoretical quantities have been successful in identifying statistically relevant features. However, the information theoretical quantities they are based on do not incorporate causality, rendering them unsuitable for such scenarios. To address this challenge, this article proposes information theoretical quantities that incorporate the causal structure of the system, which can be used to evaluate causal importance of features for some given outcome variable. Specifically, we introduce causal versions of entropy and mutual information, termed causal entropy and causal information gain, which are designed to assess how much control a feature provides over the outcome variable. These newly defined quantities capture changes in the entropy of a variable resulting from interventions on other variables. Fundamental results connecting these quantities to the existence of causal effects are derived. The use of causal information gain in feature selection is demonstrated, highlighting its superiority over standard mutual information in revealing which features provide control over a chosen outcome variable. Our investigation paves the way for the development of methods with improved interpretability in domains involving causation.Comment: 16 pages. Accepted at the third XI-ML workshop of ECAI 2023. To appear in the Springer CCIS book serie

    Efficient algorithms for minimax decisions under tree-structured incompleteness

    Get PDF
    When decisions must be based on incomplete (coarsened) observations and the coarsening mechanism is unknown, a minimax approach offers the best guarantees on the decision maker’s expected loss. Recent work has derived mathematical conditions characterizing minimax optimal decisions, but also found that computing such decisions is a difficult problem in general. This problem is equivalent to that of maximizing a certain conditional entropy expression. In this work, we present a highly efficient algorithm for the case where the coarsening mechanism can be represented by a tree, whose vertices are outcomes and whose edges are coarse observations
    corecore